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Abstract 

In this paper, we propose and analyze a trust-region model-based algorithm for solving 
unconstrained stochastic optimization problems. Our framework utilizes random models of an 
objective function /(x), obtained from stochastic observations of the function or its gradient. 
Our method also utilizes estimates of function values to gauge progress that is being made. The 
convergence analysis relies on requirements that these models and these estimates are sufficiently 
accurate with high enough, but fixed, probability. Beyond these conditions, no assumptions are 
made on how these models and estimates are generated. Under these general conditions we show 
an almost sure global convergence of the method to a first order stationary point. In the second 
part of the paper, we present examples of generating sufficiently accurate random models under 
biased or unbiased noise assumptions. Lastly, we present some computational results showing 
the benefits of the proposed method compared to existing approaches that are based on sample 
averaging or stochastic gradients. 


1 Introduction 

Derivative free optimization (DFO) [8] has recently grown as a field of nonlinear optimization which 
addressed optimization of black-box functions, that is functions whose value can be (approximately) 
computed by some numerical procedure or an experiment, while their closed-form expressions 
and/or derivatives are not available and cannot be approximated accurately or efficiently. Although 
the role of derivative-free optimization is particularly important when objective functions are noisy, 
traditional DFO methods have been developed primarily for deterministic functions. The fields of 
stochastic optimization [18, 24] and stochastic approximation [28] on the other hand focus on 
optimizing functions that are stochastic in nature. Much of the focus of these methods depend 
on the availability and use of stochastic derivatives, however, some work has addressed stochastic 
black box functions, typically by some sort of a finite differencing scheme [16]. 
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In this paper, using methods developed for DFO, we aim to solve 

min f{x) (1) 

where f{x) is a function which is assumed to be smooth and bounded from below, and whose value 
can only be computed with some noise. Let / be the noisy computable version of /, which takes 
the form 

f{x) = f{x,e) 

where the noise e is a random variable. 

In recent years, some DFO methods have been extended to and analyzed for stochastic functions 
[9]. Additionally, stochastic approximation methodologies started to incorporate techniques from 
the DFO literature [5]. The analysis in all that work assumes some particular structure of the 
noise, including the assumption that the noisy function values give an unbiased estimator of the 
true function value. 

There are two main classes of methods in this setting of stochastic optimization: stochastic 
gradient (SO) methods (such as the well known Robbins-Monro method) and sample averaging 
(SA) methods. The former (SG) methods roughly work as follows; they obtain a realization of 
an unbiased estimator of the gradient at each iteration and take a step in the direction of the 
negative gradient. The step sizes progressively diminish and the iterates are averaged to form a 
sequence that converges to a solution. These methods typically have very inexpensive iterations, 
but exhibit slow convergence and are strongly dependent on the choice of algorithmic parameters, 
particularly the sequence of step sizes. Many variants exist that average the gradient information 
from past iterations and are able to accept sufficiently small, but nondecreasing step sizes [15, 1]. 
However, the convergence remains slow, and parameter tuning remains necessary in most cases. 
Moreover, the majority of these methods have been developed exclusively for convex functions and 
may not converge in non convex settings. An exception is the randomized stochastic (accelerated) 
gradient (RS(A)G) method presented in [13], [12]. This method employs random stopping criteria 
to provide theoretical first-order convergence even in nonconvex cases, but does not appear to be 
very practical. 

The second class of methods, (SA), is based on sample averaging of the function and gradient es¬ 
timators, which is applied to reduce the variance of the noise. These methods repeatedly sample the 
function value at a set of points in hopes to ensure sufficient accuracy of the function and gradient 
estimates. For a thorough introduction and references therein, see [20]. The optimization method 
and sampling process are usually tightly connected in these approaches, hence, again, algorithmic 
parameters need to be specially chosen and tuned. These methods tend to be more robust with 
respect to parameters and enjoy faster convergence at a cost of more expensive iterations. However, 
none of these methods are applicable in the case of biased noise and they suffer significantly in the 
presence of outliers. 

The goal of this paper is to show that a standard efficient unconstrained optimization method, 
such as a trust region method, can be applied, with very small modifications, to stochastic nonlinear 
(not necessarily convex) functions and can be guaranteed to converge to first order stationary points 
as long as certain conditions are satisfied. We present a general framework, where we do not specify 
any particular sampling technique. The framework is based on the trust region DFO framework [8], 
and its extension to probabilistic models [2]. In terms of this framework and the certain conditions 
that must be satisfied, we essentially assume that 
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• the local models of the objective function constructed on each iteration satisfy some first 
order accuracy requirement with sufficiently high probability, 

• and that function estimates at the current iterate and at a potential next iterate are sufficiently 
accurate with sufficiently high probability. 

The main novelty of this work is the analysis of the framework and the resulting weaker, more 
general, conditions compared to prior work. In particular, 

• we do not assume that the probabilities of obtaining sufficiently accurate models and estimates 
are increasing (they simply need to be above a certain constant) and 

• we do not assume any distribution of the random models and estimates. In other words, if a 
model or estimate is inaccurate, it can be arbitrarily inaccurate, i.e. the noise in the function 
values can have nonconstant bias. 

It is also important to note that while our framework and model requirements are borrowed from 
prior work on DFO, this framework applies to derivative-based optimization as well. Later in the 
paper we will discuss different settings which will fit into the proposed framework. 

This paper consists of two main parts. In the first part we propose and analyze a trust region 
framework, which utilizes random models of f{x) at each iteration to compute the next potential 
iterate. It also relies on (random, noisy) estimates of the function values at the current iterate and 
the potential iterate to gauge the progress that is being made. The convergence analysis then relies 
on requirements that these models and these estimates are sufficiently accurate with sufficiently 
high probability. Beyond these conditions, no assumptions are made about how these models and 
estimates are generated. The resulting method is a stochastic process that is analyzed with the 
help of martingale theory. The method is shown to converge to first order stationary points with 
probability one. 

In the second part of the paper we consider various scenarios under different assumptions on the 
noise inducing component e and discuss how sufficiently accurate random models can be generated. 
In particular, we show that in the case of unbiased noise, that is when E[/(x,e)] = f{x) and 
Var[/(x,e)] < cr^ < oo for all x, sample averaging techniques give us sufficiently accurate models. 
Although we will prove convergence under the mentioned framework that essentially says we have 
the ability to compute both sufficiently accurate models and estimates with constant, separate 
probabilities, it is not necessarily easy to estimate what these probabilities ought to be for a given 
problem. While we provide some guidance on the selection of sampling rates in an unbiased noise 
setting in Section 5, our numerical experiments show that the bounds on probabilities suggested 
by our theory to be necessary for almost sure convergence are far from tight. 

We also discuss the case where E[/(x,e)] / f{x), and where the noise bias may depend on x 
or on the method of computation of the function values. One simple setting, which is illustrative, 
is as follows. Suppose we have an objective function, which is computed by a numerical process, 
whose accuracy can be controlled (such as tightening a convergence criterion within this numerical 
process). Suppose now that this numerical process involves some random component (such as 
sampling from some large data and/or utilizing a randomized algorithm). It may be known that 
with sufficiently high probability this numerical process produces a sufficiently accurate function 
value - however, with some small (but nonzero) probability the numerical process may fail and 
hence no reasonable value is guaranteed. Moreover, such failures may become more likely as more 
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accurate computations are required (for instance because an upper bound on the total number of 
iterations is reached inside the numerical process). Hence the probability of failure may depend on 
the current iterate and state of the algorithm. Here we simply assume that such failures do not occur 
with probability higher than some constant (which will be specihed in our analysis), conditioned 
on the past. However, we do not assume anything about the inaccurate function values. As we will 
demonstrate later in this paper, in this setting, E[/(x,e)] / fix). 

1.1 Comparison with related work 

There is a very large volume of work on SA and SG, most of which is quite different from our 
proposed analysis and method. However, we will mention a few works here that are most closely 
related to this paper and highlight the differences. The three methods existing in the literature 
we will compare with are by Deng and Ferris [9], SPSA (simultaneous perturbations stochastic 
approximation) [26, 27], and SCSR (sampling controlled stochastic recursion) [14]. These three 
settings and methods are most closely related to our work because they all rely on using models 
of the objective function that can both incorporate second-order information and whose accuracy 
with respect to a “true” model can be dynamically adjusted. In particular, Deng and Ferris apply 
the trust-region model-based derivative free optimization method UOBYQA [21] in a setting of 
sample path optimization [23]. In [26] and [27], the author applies an approximate gradient descent 
and Newton method, respectively, with gradient and Hessian estimates computed from specially 
designed finite differencing techniques, with decaying finite differencing parameter. In [14] a very 
general scheme is presented, where various deterministic optimization algorithms are generalized as 
stochastic counterparts, with the stochastic component arising from the stochasticity of the models 
and the resulting step of the optimization algorithm. We now compare some key components 
of these three methods with those of our framework, which we hereforth refer to as STORM 
(STochastic Optimization with Random Models). 

Deng and Ferris: The assumptions of the sample path setting are roughly as follows; on each 
iteration k, given a collection of points = {x\,... ,Xp} one can compute noisy function values 
f{x\,£^),... ,fixp,£^). The noisy function values are assumed to be realizations of an unbiased 
estimator of true values /(xf),..., /(Xp). Then, using multiple, say N^, realizations of e^, average 
function values /'^*(xj),..., f^^{Xp) can be computed. A quadratic model m^*(x) is then fit into 
these function values, and so a sequence of models {m^''(x)} is created using a nondecreasing 
sequence of sampling rates {A^}. The assumption on this sequence of models is that each of them 
satisfies a sufficient decrease condition (with respect to the true model of the true function /) 
with probability 1 — such that < oo. The trust region maintenance follows the usual 

scheme like that in UOBYQA, hence the steps taken by the algorithm can be increased or decreased 
depending on the observed improvement of the function estimates. 

SPSA: The first order version of this method assumes that /(x, e) is an unbiased estimate of 
fix), and the second order version, 2SPSA, assumes that an unbiased estimate of V/(x), 5 '(x,e) G 
can be computed. Gradient (in the first order case) and Hessian (in the second order case) 
estimates are constructed using an interesting randomized finite differencing scheme. The finite 
difference step is assumed to be decaying to zero. An approximate steepest descent direction or 
approximate Newton direction are then constructed and a step of length tk is taken along this 
direction. The sequence {tk} is assumed to be decaying in the usual Robbins-Monro way, that is 
tk —^ 0, tk = oo. Hence, while no increase in accuracy of the models is assumed (they only 
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need to be accurate in expectation), the step size parameter and the hnite differencing parameter 
need to be tuned. Decaying step sizes often lead to slow convergence, as has been observed often 
in stochastic optimization literature. 

SCSR: This is a very general scheme which can include multiple optimization methods and 
sampling rates. The key ingredients of this scheme are a deterministic optimization method, and 
a stochastic variant that approximates it. The stochastic step (recursion) is assumed to be a 
sufficiently accurate approximation of the deterministic step with increasing probability (the prob¬ 
abilities of failure for each iteration are summable). This assumption is stronger than the one in 
this paper. In addition, another key assumption made for SCSR is that the iterates produced by 
the base deterministic algorithm converge to the unique optimal minimizer x*. Not only we do not 
assume here that the minimizer/stationary point is unique, but we also do not assume a priori that 
the iterates form a convergent sequence, since they may not do so in a non convex setting, while 
every iterate subsequence converges to a stationary point. 

STORM: Like the Deng and Ferris method, we utilize a trust-region, model-based framework, 
where the size of the trust region can be increased or decreased according to empirically observed 
function decrease and the size of the observed approximate gradients. The desired accuracy of the 
models is tied only to the trust region radius in our case, while for Deng and Ferris, it is tied to 
both the radius and the size of true model gradients (the second condition is harder to ensure). In 
either method, this desired accuracy is assumed to hold with some probability - in STORM, this 
probability remains constant throughout the progress of the algorithm, while for Deng and Ferris 
it has to converge to 1 sufficiently rapidly. 

There are three major advantages to our results. First of all, in the case of unbiased noise, the 
sampling rate is directly connected to the desired accuracy of the estimates and the probability with 
which this accuracy is achieved. Hence, for the STORM method, the sampling rate may increase or 
decrease according to the trust region radius, eventually increasing only when necessary, i.e. when 
the noise become dominating. For all the other methods listed here, the sampling rate is assumed 
to increase monotonically. Secondly, in the case of biased noise, we can still prove convergence of 
our method, as long as the desired accuracy is achieved with a fixed probability. In other words, we 
allow for the noise to be arbitrarily large with a small, but fixed probability, on each iteration. This 
allows us to consider new models of noise which cannot be handled by any of the other methods 
discussed here. In addition, STORM incorporates first and second order models without changing 
the algorithm - the step size parameter (i.e., the trust region radius) and other parameters of the 
method are chosen almost identically to the standard practices of the trust region methods, which 
have proved to be very effective in practice for unconstrained nonlinear optimization. In Section 
6 we show that the STORM method is very effective in different noise settings and is very robust 
with respect to sampling strategies. 

Finally, we want to point to an unpublished work [3], which proposes a very similar method to 
the one in this paper. Both methods were developed based on the trust region DFO method with 
random models for deterministic functions analyzed in [2] and extended to the stochastic setting. 
Some of the assumptions in this paper were inspired by an early version of [3]. However, the 
assumptions and the analysis in [3] are quite different from the ones in this paper. In particular, 
they rely on the assumption that f{x,s) is an unbiased estimate of f{x), hence their analysis does 
not extend to the biased case. Also they assume that the probability of having an accurate model 
at the k-th iteration is at least 1 — ak, such that ak —?• 0, while for our method this probability can 
remain bounded away from zero. Similarly, they assume that the probability of having accurate 
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function estimates at the k-th. iteration also converges to 1 sufficiently rapidly, while in our case it 
is again constant. Their analysis, as a result, is very different from ours, and does not generalize 
to various stochastic settings (they only focus on the derivative free setting with additive noise). 
The advantage of their method is that they do not need to put a restriction on acceptable step 
sizes, when the norm of the gradient of the model is small. We, on the other hand, impose such 
restriction in our method and use it in the proof of our main result. However, as we discuss later 
in the paper this restriction can be relaxed at the cost of more complex algorithm and analysis. In 
practice, we do not implement this restriction, hence our basic implementation is virtually identical 
to that in [3] except that we implement a variety of model building strategies, while only one such 
strategy (regression models based on randomly rotated orthogonal samples sets) is implemented in 
[3]. Thus we do not directly compare the empirical performance of our method with the method 
in [3] since we view them as more or less the same method. 

We conclude this section by introducing some frequently used notations and their meanings. 
The rest of the paper is organized as follows. In Section 2 we introduce the trust region framework, 
followed by Section 3, where we discuss the requirements on our random models and function 
estimates. The main convergence results are presented in Section 4. In Section 5 we discuss various 
noise scenarios and how sufficiently accurate models and estimates can be constructed in these 
cases. Finally, we present computational experiments based on these various noise scenarios in 
Section 6. 

Notations. Let || • || denote the Euclidean norm and B(x, A) denote the ball of radius A around x, 
i.e., B{x, A) : {y ; \\x — y\\ < A}. Q denotes the probability sample space, according to the context, 
and a sample from that space is denoted by a; G fl. As a rule, when we describe a random process 
within the algorithmic framework, uppercase letters, e.g. the k-th iterate X^, will denote random 
variables, while lowercase letters will denote realizations of the random variable, e.g. x^ = Xk{oj) 
is the k-th iterate for a particular realization of our algorithm. 

We also list here, for convenience, several constants that are used in the paper to bound various 
quantities. These constants are denoted by k with subscripts indicating quantities that they are 
meant to bound. 

Kef “error in the function value”, 

Keg “error in the gradient”, 

KEef “expectation of the error in the function value”, 

Kfed “fraction of Cauchy decrease”, 
f^bhm “bound on the Hessian of the models”, 

Ket “error in Taylor expansion”. 

2 Trust Region Method 

We consider the trust-region class of methods for minimization of unconstrained functions. They 
operate as follows: at each iteration k, given the current iterate Xk and a trust-region radius 6k, a 
(random) model mk{x) is built, which serves as an approximation of f{x) in B{xk,6k)- The model 
is assumed to be of the form 


rukixk + s) = fk -h gjs -h s'^HkS. 


( 2 ) 
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It is possible to generalize our framework to other forms of models, as long as all conditions on the 
models, described below, hold. We consider quadratic models for simplicity of the presentation and 
because they are the most common. The model mk{x) is minimized (approximately) in B{xk,Sk) 
to produce a step Sk and (random) estimates of f{xk) and f{xk + Sk) are obtained, denoted by 
and respectively. The achieved reduction is measured by comparing and and if reduction 
is deemed sufficient, then x^ + is chosen as the next iterate x^+i- Otherwise the iterate remains 
Xk- The trust-region radius Sk+i is then chosen by either increasing or decreasing Sk according to 
the outcome of the iteration. The details of the algorithm are presented in Algorithm 1. 


Algorithm 1: STOCHASTIC DFO with Random Models 

1 (Initialization): Choose an initial point xq and an initial trust-region radius Sq £ (0,(lmax) 

with (5max > 0. Choose constants 7 > 1, G (0,1), 72 > 0, set /c 0. 

2 (Model construction): Build a model mk{xk + s) = fk + gjs + HkS that approximates 

f{x) on B{xk, Sk) with s = x — Xk- 

3 (Step calculation): Compute Sk = arg min mk{s) (approximately) so that it satisfies 

s:||s||<(5fc 

condition (3). 

4 (Estimates calculation): Obtain estimates and /| of f{xk) and f{xk + Sk)-, respectively. 

jO _ js 

5 (Acceptance of the trial point): Compute pk =- 7 —^^-r. If Pk > 7i and 

mk{xk) - mk[xk + Sk) 

llfl'fcll > mh, set Xfc+i = Xk + Sk; otherwise, set Xk+i = Xk- 

6 (Trust-region radius update): If pk > pi and \\gk\\ > P 2 h, set 4+i = minjydfc, dmax}; 

otherwise Sk+i = 'y~^Sk; k k + 1 and go to step 1. 


The trial step computed on each iteration has to provide sufficient decrease of the model; in 
other words it has to satisfy the following standard fraction of Cauchy decrease condition: 

Assumption 2.1. For every k, the step Sk is computed so that 

mk{xk) -mk{xk + Sk) > ^^||5fc||niin|||^^^,4| (3) 

for some constant Kfcd £ (0,1]. 

If progress is achieved and a new iterate is accepted in the k-th. iteration then we call this 
a successful iteration. Otherwise, the iteration is unsuccessful (and no step is taken). Hence a 
successful iteration occurs when pk > pi and > P 2 dk- However, a successful iteration does 
not necessarily yield an actual reduction in the true function /. This is because the values of f{x) 
are not accessible in our stochastic setting and the step acceptance decision is made merely based 
on the estimates of f{xk) and f{xk + Sk). If these estimates, and /|, are not accurate enough, 
a successful iteration can result in an increase of the true function value. Hence we consider two 
types of successful iterations - those where /(x) is in fact decreased proportionally to 
which we call true successful iterations, and all other successful iterations, where the decrease 
of /(x) can be arbitrarily small or even negative, which we call false successful iterations. Our 
setting and algorithmic framework does not allow us to determine which successful iterations are 
true and which ones are false, however, we will be able to show that true successful iterations 
occur sufficiently often for convergence to hold, if the random estimates and /| are sufficiently 
accurate. 
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A trust region framework based on random models was introduced and analyzed in [2]. In that 
paper, the authors introduced the concept of probabilistically fully-linear models to determine the 
conditions that random models should satisfy for convergence of the algorithm to hold. However, 
the randomness in the models in their setting arises from the the construction process, and not 
from the noisy objective function. It is assumed in [2] that the function values at the current iterate 
and the trial point can be computed exactly and hence all successful iterations are true in that 
case. In our case, it is necessary to define a measure for the accuracy of the estimates and 
(which, as we will see, generally has to be tighter than the measure of accuracy of the model). We 
will use a modified version of the probabilistic estimates introduced in [17]. 

3 Probabilistic Models and Estimates 

The models in this paper are functions which are constructed on each iteration, based on some 
random samples of stochastic function /(x). Hence, the models themselves are random and so is 
their behavior and influence on the iterations. Hence, will denote a random model in the A:-th 
iteration, while we will use the notation ruk = Mk{oj) for its realizations. As a consequence of 
using random models, the iterates A^, the trust-region radii and the steps are also random 
quantities, and so Xk = Xk{u}), 5k = Afc(w), Sk = Sk{u}) will denote their respective realizations. 
Similarly, let random quantities denote the estimates of f{Xk) and f{Xk + Sk), with 

their realizations denoted by /)? = F^{co) and = F^{io). In other words. Algorithm 1 results in 

a stochastic process {Mk, Xk, Sk, Xk, Fj), F^}. Our goal is to show that under certain conditions 

on the sequences {Mk} and the resulting stochastic process has desirable convergence 

properties with probability one. In particular, we will assume that models Mk and estimates 
F^,F^ are sufficiently accurate with sufficiently high probability, conditioned on the past. 

To formalize conditioning on the past, let denote the u-algebra generated by Mq, • • • , Mk-i 

and Fo, • • • , Fk-i and let denote the ci-algebra generated by Mq, • • • , Mk and Fq, • • • , Fk-i- 

To formalize sufficient accuracy, let us recall a measure for the accuracy of deterministic models 
introduced in [8] and [7] (with the exact notation introduced in [4]). 

Definition 3.1. Suppose V/ is Lipschitz continuous. A function nik is a K-fully linear model of 
/ on B(xk, 5k) provided, for k = {Kef, Keg) and Vy G B, 

l|V/(y) - Vmfc(y)|| < Keg5k, and (4) 

\f{y)-mk{y)\ < Kefdl- 

In this paper we extend the following concept of probabilistically fully-linear models which is 
proposed in [2]. 

Definition 3.2. A sequence of random models {Mk} is said to he a-probabilistically K-fully linear 
with respect to the corresponding sequence {B{Xk, Xk)} if the events 

Ik = {Mk is a K-fully linear model of f on B{Xk, Xk)} (5) 

satisfy the condition 

P{h\Fti) > a, 

where is the a-algebra generated by Mq, • • • , Mk-i- 



These probabilistically fully-linear models have the very simple properties that they are fully- 
linear (i.e., accurate enough) with sufficiently high probability, conditioned on the past, and they 
can be arbitrarily inaccurate otherwise. This property is somewhat different from the properties of 
models typical to stochastic optimization (such as, for example, stochastic gradient based models), 
where assumptions on the expected value and the variance of the models is imposed. We will 
discuss this in more detail in Section 5. 

In this paper, aside from sufficiently accurate models, we require estimates of the function values 
/(xfc), /(xfc + Sk) that are sufficiently accurate. This is needed in order to evaluate whether a step 
is successful, unlike the case in [2] where the exact values f{xk) and f{xk + Sfc) are assumed to be 
available. The following definition of accurate estimates is a modified version of that used in [17]. 

Definition 3.3. The estimates /)? and /| are said to be ep-aecurate estimates of f{xk) and f{xk + 
Sk), respectively, for a given Sk if 

\fk - fixk)\ < epdl and \ ff - f{xk + Sfc)| < epSl- (6) 

We now modify Definitions 3.2 and 3.3 and introduce definitions of probabilistically accurate 
models and estimates which we will use throughout the remainder of the paper. 

Definition 3.4. A sequenee of random models {Mk} is said to be a-probabilistically K-fully linear 
with respect to the eorresponding sequenee {B{Xk, if the events 

Ik = {Mk is a K-fully linear model of f on B{Xk, Ak)} (7) 


satisfy the eondition 

PiWtr) > a, 

where is the a-algebra generated by Mq, • • • , Mk-i and Fq, • • • , Fk-i- 

Definition 3.5. A sequence of random estimates is said to be (I-probabilistically ep- 

accurate with respect to the corresponding sequence {Xk, Ak, Sk} if the events 

Jk = {F^,Ff are ep-accurate estimates of f{xk) and f{xkSk), respeetively, for Ak} (8) 

satisfy the eondition 

PiMPtl/2) > /?, 

where ep is a fixed constant and a-algebra generated by Mq, • • • , Mk and Fq, • • • , Fk-i- 

Using Definitions 3.4 and 3.5 we will require in our analysis that our method has access to 
a-probabilistically K-fully linear models, for some fixed k = {Kg f, Keg) and to /3-probabilistically 
CiT’-accurate estimates, for some fixed, sufficiently small ep. Thus, the model and the estimate 
accuracy will be assumed to be proportional to 5^ (with some probability), the condition on the 
estimates is somewhat tighter because of the upper bound on Ci?. However, we will see that this 
upper bound is not very small. 

Procedures for obtaining probabilistically fully-linear models and probabilistically accurate es¬ 
timates under different models of noise are discussed in Section 5. 


9 



4 Convergence Analysis 


We now present first-order convergence analysis for the general framework described in Algorithm 
1. Towards that end, we assume that the function / and its gradient are Lipschitz continuous in 
regions considered by the algorithm realizations. 

Assumption 4.1 (Assumptions on /). Let xq and Jmax be given. Let C{xq) define the set in 
M"" whieh contains all iterates of our algorithm. Assume that f is bounded from below on C{xo). 
Assume also that the function f and its gradient V/ are L-Lipschitz continuous on the set Cenfixo), 
where Cenfixo) defines the region considered by the algorithm realizations 

h^enlixfi) — d3(^X, Sjjiax) ■ 

xGL(xo) 


Remark 4.2. In the case of deterministic functions C{xo) = {x G : f{x) < /(xq)}, because 
algorithm iterates never increase the objective function value, hence they do not step outside the 
initial level set. However, here we allow iterates to increase the function value, because the true 
function value is not known. Such iterates, as we will see, may happen with some (relatively small) 
probability, hence the algorithm can venture outside the initial level set. Hence we choose to make 
the assumption above, which of course depends on the algorithmic behavior. Clearly, if we assume 
a global Lipschitz constant and global lower bound, then the above assumption always holds. If we 
prefer to weaken this assumption, then there are several algorithmic remedies possible, however, 
they will make our analysis more complicated and we choose to leave it for future work. It will be 
evident from our analysis, that with a probability arbitrarily close to 1 we can bound the set C{xo). 

The second assumption provides a uniform upper bound on the model Hessian. 

Assumption 4.3. There exists a positive constant Khhm such that, for every k, the Hessian H^ of 
all realizations of satisfy 

Note that since we are concerned with convergence to a first order stationary point in this paper, 
the bound Kbhm can be chosen to be any nonnegative number, including zero. Allowing a larger 
bound will give more flexibility to the algorithm and may allow better Hessian approximations, but 
as we will see in the convergence analysis, this imposes restrictions on the trust region radius and 
some other algorithmic parameters. 

We now state the following result from martingale literature [11] (see Exercise 5.3.1) that will 
be useful later in our analysis. 

Theorem 4.4. Let Gk be a submartingale, i.e., a sequence of random variables which, for every k, 

E[Gk\Hti\ > Gk-i, 

where = cr(Go, ..., Gk-i) is the a-algebra generated by Gq, ..., Gk-i, and E[Gk\Hjf_fi denotes 
the conditional expectation of Gk given the past history of events H^_i. 

Assume further that Gk — Gk-i < M < oo, for every k. Then, 


P 


lim Gk < oo 

k^oo 


u 


lim sup Gk = oo 

k^oo 


= 1 . 


(9) 
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We now prove some auxiliary lemmas that provide conditions under which decrease of the true 
objective function f{x) is guaranteed. The first lemma states that if the trust region radius is small 
enough relative to the size of the model gradient and if the model is fully linear, then the step Sk 
provides a decrease in f{x) proportional to the size of the model gradient. Note that the trial step 
may still be rejected if the estimates and are not accurate enough. 


Lemma 4.5. Suppose that a model ruk of the form (2) is a {Kef, Keg)-fully linear model of f on 
B{xk,5k). If 


dk < min 


1 


f^fcd 


\\9k\l 


^bhm 8Hef 

then the trial step Sk leads to an improvement in f{xk + Sk) such that 


f{xk + Sk) - f{xk) < -‘^^hkWdk- 


( 10 ) 


Proof. Using the Cauchy decrease condition, the upper bound on model Hessian and the fact that 
Ibfcll > SLbhmdk, we have 


mk{xk) -mk{xk + Sk) > ^^||5(fc||min||^|||,4| = 

Since the model is K-fully linear, one can express the improvement in / achieved by Sk as 


< 

< 


f{xk + Sk) - f{Xk) 


f{xk + Sk) - m{xk + Sk) + m{xk + Sk) 
2KefSl - 


S^fcd 

~T~ 




m{xk) + m{xk) - f{xk) 


where the last inequality is implied by 5k < |^||5fc||- D 

The next lemma shows that for 5k small enough relative to the size of the true gradient 'V f{xk), 
the guaranteed decrease in the objective function, provided by Sk, is proportional to the size of the 
true gradient. 


Lemma 4.6. Under Assumption f.3, suppose that a model is {Kef, Keg)-fully linear on B{xk,5k). 

If 


5k < min 


l|V/(xfc)||, 


Kbhm + Heg ’ ^ + Ke 
^fcd 

then the trial step Sk leads to an improvement in f{xk + Sfc) such that 

f{xk + Sk) - f{xk) < -C'i||V/(xfc)||(5fc, 


where Ci = • max | 


^bhn 


Sk, 




e/ 


( 11 ) 


( 12 ) 
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Proof. The definition of a K-fully-linear model yields that 


> IIW(a;)ll - KegSk- 

Since condition (11) implies that ||V/(xa:)|| > maxi Khhm + i^eg, + Heg [ ^k, we have 

ll^fcll > max I Khhm, 1 4- 
I i^fcd J 

Hence, the conditions of Lemma 4.5 hold and we have 

f{xk + Sk) - fixk) < -^^\\gk\\Sk- 


Since 


> ||V/(x)|| — KegSk in which 4 satisfies (11), we also have 


> max 


^bhm 


SKqJ^ 


^bhm T ^eg T bifcdb^eg 
Combining (13) and (14) yields (12). 


l|V/(xfc)||. 


(13) 

(14) 

□ 


We now prove the lemma that states that, if a) the estimates are sufficiently accurate, b) the 
model is fully-linear and c) the trust-region radius is sufficiently small relatively to the size of the 
model gradient, then a successful step is guaranteed. 

Lemma 4.7. Under Assumption 4-3, suppose that mk is {Kef, Keg)-fully linear on B{xk,Sk) and 
the estimates are ep-accurate with ep < Kef. If 


4 < min 


1 1 Kfed{l-Vl) 


b^bhm h2 


SKef 


hkW, 


(15) 


then the k-th iteration is successful. 
hkW 

yield that 


Proof. Since 6k < , the Cauchy decrease condition and the uniform bound on Hk immediately 

^bhm 


mk{xk) -mk{xk + Sk) > ^^^\\gk\\m.mf^bl]A^Sk\ = ^^^\\gk\\dk■ (16) 

^ L b^bhm J ^ 


The model mk being (^e/, Keg)-fully linear implies that 

\f{xk)-mk{xk)\ < KefSl, and 
\f{Xk +Sk) -mk{Xk-I Sk)\ < Hef^l- 

Since the estimates are CiT’-accurate with ep < Kef, we obtain 

\fk - fiXk)\ < KefSl, and \ ff. - f{Xk + Sfc)| < i^ef^l- 


(17) 

(18) 


(19) 
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We have 


mfc(xfc) - mk{xk + Sk) 

fk - fi^k) _^ fjxk) - rukjxk) mkjxk) - rukjxk + Sk) 

rrikixk) - nikixk + Sk) mk{xk) - mk{xk + Sk) nikixk) - mk{xk + Sk) 
^ mkjxk + Sk) - fjxk + Sk) fjxk + Sk) - fk 

rukixk) - mk{xk + Sk) rnk{xk) - mk{xk + Sk) ’ 


which , combined with (16)-(19), implies 


\pk-l\ < 


8l^efSl 

l^fcd\\9k\\h 


< 1 - m, 


where we have used the assumption 5k < -^^^^j^WdkW to deduce the last inequality. Hence, 
Pk > Pi- Moreover, since ||(/fc|| > p 2 Sk, the /c-th iteration is successful. □ 


Finally, we state and prove the lemma which guarantees an amount of decrease of the objective 
function on a true successful iteration. 


Lemma 4.8. Under Assumption 4-3, suppose that the estimates are ep-accurate with 

niin I l|. If a trial step Sk is accepted (a suceessful iteration oecurs), then the 
improvement in f is bounded below as follows 

f{xk+i) - f{xk) < -C25l, (20) 

where C 2 = 5 ? 7 ir/ 2 K/crf min | l| - 2eF > 0. 

Proof. An iteration being successful indicates that \\gk\\ > P2.5k and p > pi- Thus, 

fk - fk > Viimkixk) - mk{xk + Sk)) 

^ Ffcd,, II . J IlS'fcll r \ 

> hi’?2K/crfniin I 

^ L Fbhm J 

Then, since the estimates are eiT’-accurate, we have that the improvement in / can be bounded as 

f(xk + Sk) - fixk) = f{xk + Sk) - fk + fk- fk+fk- fixk) < -C 25 I, 

where C 2 = 5 r/ir/ 2 K/cd min | l| - 2eF >0. □ 

To prove convergence of Algorithm 1 we will need to assume that models {Mk} and estimates 
{F^,Ff} are sufficiently accurate with sufficiently high probability. 

Assumption 4.9. Given values of a, (I G (0,1) and ep > 0, there exist Keg and Kef such that the 
the sequence of models {Mk} and estimates {F^,Ff} generated by Algorithm 1 are, respectively, 
a-probabilistically {Kef, Keg)- fully-linear and ft-probabilistically ep-accurate. 
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Remark 4.10. Note that this assumption is a statement about the existence of constants k = 
{Kef, Keg) gwcu an a, (3 and ep - we will determine exact conditions on a, f3 and ep in Theorem 
4-11 and Lemma 4-12 below. 

The following theorem states that the trust-region radius converges to zero with probability 1. 

Theorem 4.11. Let Assumptions 4-1 and 4-3 be satisfied and assume that in Algorithm 1 the 
following holds. 


• The step acceptance parameter rj 2 is chosen so that 


m 


> max < Kbhm 


SKef 


Kfcd{l - Vl) 


( 21 ) 


• The accuracy parameter of the estimates satisfies 


ep < 


min 



1 


: Vim fed 


( 22 ) 


Then a and j3 can be chosen so that, if Assumption 4-9 holds for these values, then the sequence 
of trust-region radii, {A^}, generated by Algorithm 1 satisfies 

OO 

(23) 

A:=0 


almost surely. 

Proof. We base our proof on properties of the random function = pf{Xif) -|- (1 — n)A‘^, where 
K £ (0,1) is a fixed constant, which is specified below. A similar function is used in the analysis 
in [17], but the analysis itself is different. The overall goal is to show that there exists a constant 
O' > 0 such that for all k 

E[<^k+i - < -oAl < 0. (24) 

Since / is bounded from below and > 0, we have that is bounded from below for all k 
and hence if (24) holds on every iteration, then by summing (24) over k £ (l,oo) and taking 
expectations on both sides we can conclude that (23) holds with probability 1. Hence, to prove the 
theorem we need to show that (24) holds on each iteration. 

Let us pick some constant f which satisfies 


C A Sleg + max 


/ 1 

l'^^’^/cd(l-^l)r 


(25) 


We now consider two possible cases: ||V/(xfc)|| > and ||V/(xfc)|| < fdk. We will show that (24) 
holds in both cases and hence it holds on every iteration. Given C, we now select v £ (0,1) such 
that 


V 

- > max 

1 — 1 / 


4 - y 2 4^2 ^2 '1 

CCi ’ VlV2Sifed Hef / ’ 


(26) 


with Cl defined as in Lemma 4.6. 
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As usual, let Xk, s^, Qk-, and (/>fc denote realizations of random quantities Xk, A^, 5^, Gk, 
and ^k, respectively. 

Let us consider some realization of Algorithm 1. Note that on all successful iterations, Xk+i = 
Xk + Sk and 4 +i = min{7(5fc, with 7 > 1 , hence 

(pk+i -4>k< iy{f{xk+i) - f{xk)) + (1 - - l)Sl- (27) 

On all unsuccessful iterations, Xk+i = Xk and 5k+i = i-e. 

(/)fc+i - (/)fc = (1 - = 61 < 0 . (28) 

For each iteration and each of the two cases we consider, we will analyze the four possible 
combined outcomes of the events Ik and Jk as defined in (7) and ( 8 ), respectively. 

Before presenting the formal proof let us outline the key ideas. We will show that, unless both 
the model and the estimates are bad on iteration k, we select u G ( 0 , 1 ) sufficiently close to 1 , so 
that the decrease in (pk on a successful iteration is greater than the decrease on an unsuccessful 
iteration (which is equal to 61 , according to (28)). When the model and the estimates are both 
bad, an increase in (pk may occur. This increase is bounded by a value proportional to when 
\\'^f{xk)\\ < C^k- When ||V/(xfc)|| > C5k, though, the increase in cpk may be proportional to 
\\Vf{xk)\\6k- However, since iterations with good models and good estimates will provide decrease 
in (pk also proportional to \\Vf{xk)\\6k, by choosing values of a and pi close enough to 1 , we can 
ensure that in expectation cpk decreases. 

We now present the proof. 


Case 1: ||V/(xfc)|| > C4- 


a. Ik and Jk are both true, i.e., both the model and the estimates are good on iteration k. From 
the definition of p, we know 

II V/(x,)|| > + max 

Then since the model is k- fully linear and, from 772 > Hbhrm < ^e/ and 0 < r/i < 1, it 
is easy to show that the condition (11) in Lemma 4.6 holds. Therefore, the trial step Sk leads 
to a decrease in / as in ( 12 ). 

Moreover, since 


lls-fcll > l|V/(xfc)|| - Kegdk > (C - Heg)Sk > max 



F/cd(l - Vl) 


Sk, 


and the estimates {/fc,/|} are ei?-accurate, with ep < the condition (15) in Lemma 4.7 
holds. Hence, iteration k is successful, i.e. x^+i = Xk + Sk and Sk+i = 'ySk- 


Combining (12) and (27), we get 


(pk+i -(pk< -T^Ci\\Vf{xk)\\dk + (1 - - l)Sl = 62, ( 29 ) 


with Cl defined in Lemma 4.6. Since ||V/(xfc)|| > pSk we have 

62 < [-fCiC + (1 - J^)( 7 ^ - l)]<^fc < 0, 


(30) 


for u G ( 0 , 1 ) satisfying (26). 
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b. Ik is true and is false, i.e., we have a good model and bad estimates on iteration k. 

In this case, Lemma 4.6 still holds, that is Sk yields a sufficient decrease in /, hence, if the 
iteration is successful, we obtain (29) and (30). However, the step can be erroneously rejected, 
because of inaccurate probabilistic estimates, in which case we have an unsuccessful iteration 
and (28) holds. Since (26) holds the right hand side of the first relation in (30) is strictly 
smaller than the right hand side of the first relation in (28) and therefore, (28) holds whether 
the iteration is successful or not. 

c. Ik is false and Jk is true, i.e., we have a bad model and good estimates on iteration k. In this 
case, iteration k can be either successful or unsuccessful. In the unsuccessful case (28) holds. 
When the iteration is successful, since the estimates are CiT’-accurate and (22) holds then by 
Lemma 4.8 (20) holds with C 2 > \r]ir] 2 K.fcd- Hence, in this case we have 

4)k+i - (pk < [- 1^02 + (1 - 1 ^)( 7 ^ - l)]<Jfc- (31) 

Again, due to the choice of i' satisfying (26) we have that, as in case (b), (28) holds whether 
the iteration is successful or not. 

d. Ik and Jk are both false, i.e., both the model and the estimates are bad on iteration k. 

Inaccurate estimates can cause the algorithm to accept a bad step, which may lead to an 
increase both in / and in Sk- Hence in this case 4 >k+i — (t>k niay be positive. However, 
combining the Taylor expansion of f{xk) at Xk + Sk and the Lipschitz continuity of V/(x) we 
can bound the amount of increase in /, hence bounding (pk+i ~ 4>k from above. By adjusting 
the probability of outcome (d) to be sufficiently small, we can ensure that in expectation ^k 
is sufficiently reduced. 

In particular, from Taylor’s Theorem and Lipschitz continuity of Vf{x) we have, respectively, 

f{xk)-f{xk + Sk) > Vf{xk + Sk)'^{-Sk)and 

||V/(xfc + Sfc) - V/(xfc)|| < Lsk<L5k. 

From this we can derive that any increase of f{xk) is bounded by 

f{Xk + Sk) - f{xk) < C'3||V/(xfc)||4, 
where Cs = 1 + |^. Hence, the change in function (p is bounded: 

(pk+i -(pk< f{xk)\\h + (1 - - l)5l = 63 . (32) 


Now we are ready to take the expectation of <I>fc+i — for the case when ||V/(Afc)|| > We 
know that case (a) occurs with a probability at least afl (conditioned on the past) and in that case 
(pk+i — 4’k = b 2 < 0 with 62 defined in (29). Case (d) occurs with probability at most (1 — a)(l — (I) 
and that case 4>k+i — 4>k is bounded from above by 63 > 0. Cases (b) and (c) occur otherwise and 
in those cases (pk+i — 4>k is bounded from above by 61 < 0, with bi defined in (28). Finally we note 
that bi > 62 due to our choice of u. 
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Hence, we can combine (28), (29), (31), and (32), and use Bi, B 2 , and H 3 as random counter¬ 
parts of 61 , 62 , and 63 , to obtain the following bound 

E[^k+i - > cAfc}] 

< o:(3B2 + [<a(l ~ /3) + (1 ~ oi)j3\B\ -|- (1 — q;)(1 — f3)B^ 

= a(3[-uCi\\Vf{Xk)\\Ak + (1 - i^)(7' - 1)A2] 

-|-[a(l — /I) -I- (1 — a)/3](l — — 1)A| 

+(1 - a)(l - f3)[uC3\\Vf{Xk)\\Ak + (1 - - 1)A|]. 

Rearranging the terms we obtain 

E[^k+i - ^k\X^j[,{\NfiXk)\\ > CAfc}] 

< [-uCiap + (1 - a)(l - /3)iyC3]\\Vf{Xk)\\Ak 

+ [a(3 - ^{a{l -/?) + (!- a)f3) + (1 - a)(l - /?)](! - - 1)A| 

T 

< [-Cia/3 + (1 - a)(l - P)C3MXf{Xu)\\Au + (1 - - 1)A|, 


where the last inequality holds because a/3 —— /3) + (1 — a)/3) -|- (1 — a)(l — /3) < [a -|- (1 — 

«)][(/5 + (1 - /?)] = 1 - 

Let us choose 0 < a < 1 and 0 < /3 < 1 so that they satisfy 


(q/^- h) ^ 

(l-a)(l-/3) - Cl 


(33) 


which implies 

[Ciaf3 - (1 - a)(l - /3)C3] > ^Ci > 2 ^^"^^ 

where the last inequality is the result of (26). It is important to note that the quantity ^ in the 
numerator of (33) is chosen so that the first inequality of the above expression holds. This quantity 
can be made arbitrarily small, and with appropriate adjustment to (26) one can have 

[Cia/3 - (1 - a)(l - /3)C3] > 0iCi > 

where 0i is positive and arbitrarily close to zero and 02 > 1 and arbitrarily close to one. We will 
return to this comment in a later remark, but for the sake of simplicity, we choose 0 i = ^ and 
02 = 2 . 

Recall that ||V/(Xfc)|| > CA^, hence 


[_Cia/3 + (1 - a)(l - f5)C3MXf{Xu)\\Au + (1 - - l)Al 

< ^[-Cia/3 + (1 - a)(l - (5)C3MXf{Xk)\\Ak < --^Ciu\\Xf{X,,)\\Ak 

In summary, we have 

E[^k+1 - ^k\Xi!:[,{\\VfiXk)\\ > CAfc}] < -^Ciu\\Vf{Xk)\\Ak (34) 
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and 


{||V/(Xfc)|| > CA,}] <-^{l-u){j^- 1)A| (35) 

For the purposes of this lemma and the liminf-type convergence result, which will follow, bound 
(35) is sufficient. We will use bound (34) in the proof of the lim-type convergence result. 


Case 2: Let us consider now the iterations when ||V/(xfc)|| < (5k- First we note that 
if lls'fcll < ??2<5fc, then we have an unsuccessful step and (28) holds. Hence, we now assume that 
Ibfcll > and again consider four possible outcomes. We will show that in all situations, except 
when both the model and the estimates are bad, (28) holds. In the remaining case, because 
\\^f{xk)\\ < C^k the increase in 4)k can be bounded from above by a multiple of (5^. Hence by 
selecting appropriate values for probabilities a and /3 we will be able to establish the bound on 
expected decrease in 4>fc as in Case 1. 

a. Ik and Jk are both true, i.e., both the model and the estimates are good on iteration k. 

The iteration may or may not be successful, even though Ik is true. On successful iteration 
good model ensures reduction in /. Applying the same argument as in the case 1(c) we 
establish (28). 

b. Ik is true and Jk is false, i.e., we have a good model and bad estimates on iteration k. 

On unsuccessful iterations, (28) holds. On successful iterations, > r] 26 k and r ]2 > i^bhm 
imply that 


mk{xk) - mk{xk + Sk) > 


mm 


\Hk 


,^k 


^ k^fcd j-2 


Since Ik is true, the model is K-fully-linear, and the function decrease can be bounded as 


f{xk) - f{xk + Sk) 

= f{xk) - mk{xk) + mk{xk) - rukixk + Sfc) + rnk{xk + Sk) - f{xk + Sk) 

> 

due to (21). 

It follows that, if k-th. iterate is successful, then 

(pk+i -4>k< + (1 - i^)(7^ - l)]<5i- (36) 


Again by choosing e (0,1) so that (26) holds, we ensure that right hand side of (36) is 
strictly smaller than that of (28), hence (28) holds, whether the iteration is successful or not. 

Remark: r ]2 may need to be a relatively large constant to satisfy (21). This is due to the 
fact that the model has to be sufficiently accurate to ensure decrease in the function if a step 
is taken, since the step is accepted based on poor estimates. Note that r ]2 restricts the size of 
Ak , which is used both as a bound on the step size and the control of the accuracy. In general 
it is possible to have two separate quantities (related by a constant) - one to control the 
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step size and another to control the accuracy. Hence, it is possible to modify our algorithm 
to accept steps larger than ||g'fc||/? 72 - This will make the algorithm more practical, but the 
analysis much more complex. In this paper, we choose to stay with the simplest version, but 
keep in mind that the condition (26) is not terminally restrictive. 

c. Ik is false and Jk is true, i.e., we have a bad model and good estimates on iteration k. 

This case is analyzed identically to the case 1(c). 

d. Ik and Jk are both false, i.e., both the model and the estimates are bad on iteration k. 

Here we bound the maximum possible increase in using the Taylor expansion and the 
Lipschitz continuity of V/(x). 

f{xk + Sk) - f{xk) < ||V/(xfc)||4 + ^L5l < CsCSl- 

Hence, the change in function cp is 

+ (1 - - l)]<^fc- (37) 


We are now ready to bound the expectation of (pk+i — 4>k as we did in Case 1, except that in 
Case 2 we simply combine (37), which holds with probability at most (1 — q;)( 1 — /3) and (28) which 
holds otherwise. 


E[^k+i - ,{||V/(Xfc)|| < CAfc}] 

< [af3 + a{l-f3) + {l-a)f3]{l-u){^-l)Al 
+(1 - a)(l - /3)[uC3C + (1 - J^)(7^ - l)]Ai 

< (1 - + (1 “ “)(1 “ P ) WC 3 C + (1 - z ^)( 7 ^ - 

if we choose probabilities 0 < a < 1 and 0 < /? < 1 so that the following holds. 


(l-a)(l-/3)<^ 


7^-1 

1 + 272C'3C-t^’ 


then 

E[^k+i - {||V/(Xfc)|| < CA,}] < -^(1 - u){^^)Al 


(38) 


(39) 


In conclusion, combining (35) and (39), and noting that 1 — < 7 ^ — 1 we have 


E[^k+i - < - 1(1 - < 0 , 


which implies, that (24) holds with a = — ^(1 — i^)(l — — 1) <0. This concludes the proof of the 

theorem. 

□ 


To summarize the conditions on the probabilities involved in Theorem 4.11 to ensure that the 
theorem holds, we state the following additional lemma. 
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Corollary 4.12. Let all assumptions of Theorem f.ll hold. The statement of Theorem f-H holds 
if the a and (3 are chosen to satisfy the following conditions: 


{ajS - I) 


> 


1 + — 


and 


(1 -a)(l -/3) < 


(l-a)(l-/3)- Cl 
72-1 


with Cl = ■ max | 


^bhn 


^bhm~^^eg ’’ 8Ke/H”^/c(i^ep 


7 ^ — 1 + 72 (3L + 20 • max (, —-—, — I 

' ' / V is; \ CCl ’ »?l)?2K/cd ’ Ke/ J 

I and C = Keg + m- 


(40) 

(41) 


8k., 


e/ 


Proof. The proof follows simply from combining expression for C 3 and condition (26) with (33) 
and (38). 

□ 


Clearly, choosing a and (3 sufficiently close to 1 will satisfy this condition. 


Remark 4.13. We will briefly illustrate through a simple example how these algorithmic parameters 
scale with problem data. 

Recall that L is the Lipschitz constant of the gradient of f and of f over Ceni{x^) ■ R is reasonable 
to expect that Kef and Keg are quantities that scale with L, since Taylor models satisfy this condition, 
as do polynomial interpolation and regression models based on well-poised data sets [8]. Let us 
assume for the sake of an example that Kef = Keg = lOL. The bound on model Hessians Kbhm 
can be chosen to be arbitrarily small, at the expense of limiting the class of models, however, it 
is clearly reasonable to choose it as something that scales with L, if this information is available. 
Let us assume that Kbhm = WL, as well. In a standard trust region method, a common choice of 
algorithmic parameters would use Kfcd = 5? 7 = 2, and 71 = ^. 

The reader can easily verify that with these parameter choices and previous assumptions. Lemma 
4.12 states that we must choose 72 > 32L. The intermediate constants satisfy f > 42L and Ci = ^. 
Without loss of generality, we will simply accept C, = 42L. 

From observing that, given the above values of the constants. 


we have 


and 


max 


CCi'm'n2Kfcd Kef j L 


<T, 


{afl - \) 

(1 - a)(l - (3) 


> 9 


(l-a)(l-/3) < 


440 


(42) 

(43) 


We note that as 71 and Kfcd constants are driven closer to 1, the constant 440 can be reduced by 
up to a factor of 4. 

Supposing that Kef, Keg, and Kbhm scale linearly with L, then 72, ep, and the expressions relating 
to a and (3 in Corollary 4-12 are all functions in L satisfying 


72 > &{L), 


(44) 
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ep < 0(L), 

(45) 

( i-X 

(46) 

(l-a)(l-/3)<0(l). 

(47) 


where we use the notation ©(•) to indicate the O(-) relationship with moderate constants. 


Remark 4.14. Recall our remark made earlier in Theorem case 2b, on how r ]2 bounds our 

step sizes. Indeed, if r ]2 > QiL) has to be imposed this may force the algorithm to take small step 
sizes throughout. However, as mentioned earlier, the analysis of Theorem 4-11 can be modified 
by introducing a tradeoff between the size of r }2 and the accuracy parameters ep and (as both 
of these constant parameters can be made smaller). It also may he advantageous to choose r ]2 
dynamically. Exploring this is a subject for future work. In the practical implementations that we 
will discuss in Section 6, we do not make use of the algorithmic parameter t ]2 at all, and so even 
though r ]2 is effectively arbitrarily close to 0, the algorithm still works. 


Remark 4.15. Note that if fl = 0, then —)■ 0 for any positive value of a, which is the case 

shown in [2], since, as we pointed earlier, condition (33) can be restated with ^ replaced by an 
arbitrary small positive value. 


4.1 The liminf-type convergence 

We are ready to prove a liminf-type first-order convergence result, i.e., that a subsequence of the 
iterates drive the gradient of the objective function to zero. The proof follows closely that in [2], 
the key difference being the assumption on the function estimates that are needed to ensure that 
a good step gets accepted by Algorithm 1. 

Theorem 4.16. Let the assumptions of Theorem f-H and Lemma 4-12 hold. Suppose additionally 
that a/3 > Then the sequence of random iterates generated by Algorithm 1, {X^}, almost surely 
satisfies 

liminf||V/(Afc)|| = 0. 

k^oo 

Proof. We prove this result by contradiction conditioned on the almost sure event —)■ 0. Let us 

thus assume that there exists e' such that, with positive probability, we have 

||V/(Afc)|| >6', Vfe. 


Let {xfc} and {5^} be realizations of {X^} and {A^}, respectively for which ||V/(xfc)|| > e', V/c. 
Since lim <3^ = 0 (because we conditioned on A^ —)■ 0), there exists ko such that for all k > ko, 

k^oo 


Sk < b := min 


^^bhn 


^/cd(i - my 

WKef 


2?72 7 


(48) 


We define a random variable Rk with realizations = log..^ Then for the realization {r^} 

of {Rk}, rk < 0 for k > ko. The main idea of the proof is to show that such realizations occur only 
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with probability zero, hence obtaining a contradiction with the initial assumption of ||V/(xfc)|| > ^ 

yk. 

We first show that Rk is a submartingale. Recall the events R and Jk in Definitions 3.4 and 
3.5. Consider some iterate k > ko for which R and Jk both occur, which happens with probability 
P{Rr\ Jk) > a/3. Since (48) holds we have exactly the same situation as in Case 1 (a) in the proof 
of Theorem 4.11. In other words, we can apply Lemmas 4.6 and 4.7 to conclude that the k-th 
iteration is successful, hence, the trust-region radius is increased. In particular, since 5k < 

5k+i = 74- Consequently, rk+i = + 1. 

Let Rk^i = o'(/o, • • • , R-i) n cr{Jo, • • • , Jk-i)- For all other outcomes of R and Jk, which occur 
with total probability of at most 1 — a/3, we have < 5^+1 > ^~^5k- Hence 

E[rfc+i|T'fcl^J > a/3(rfc + 1) + (1 - a^)(rfc - 1) > Vk, 

as long as a/3 > 1/2, which implies that Rk is a submartingale. 

Now let us construct another submartingale Wk, on the same probability space as Rk which 

will serve as a lower bound on Rk and for which < lim sup Wk = oo > holds almost surely. Define 

L k^oo J 

indicator random variables I 4 and Ijj, such that = 1 if R occurs, I 4 = 0 otherwise, and 
similarly, Ij^ = 1 if occurs, Ij^ = 0 otherwise. Then define 

k 

Wk = '^{2 ■ I 4 • I 4 - 1 ). 

i =0 


Notice that Wk is a submartingale since 


nWk\H-i] = E[Wk-i\J^t\]+E[2-li^-lj^-l\Ri-_\] 

= Wk-i + 2E[li^^ ■ - 1 

= Wk-i + 2P{RnJk\Ei-^,)-l 




where the last inequality holds because a/3 > 1/2. Since Wk only has ±1 increments, it has no 

finite limit. Therefore, by Theorem 4.4, we have < lim sup Wk = 00 

L k^oo 

By the construction of Rk and Wk, we know that Vk — > Wk — Wk^- Therefore, Rk has to be 

positive infinitely often with probability one. This implies that the sequence of realizations such 
that rfc < 0 for A: > ko occurs with probability zero. Therefore our assumption that \\Vf{Xk)\\ > e' 
hold for all k with positive probability is false and 

liminf||V/(Xfc)|| = 0 

k^oo 


holds almost surely. 

□ 


4.2 The lim-type convergence 

In this subsection we show that limfe^oo ||V/(Xfc)|| = 0 almost surely. 
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We now state an auxiliary lemma, which is similar to the one in [2], but requires a different proof 
because in our case the function values f{Xk) can increase with k, while in the case considered in 
[2], function values are monotonically nonincreasing. 

Lemma 4.17. Let the same assumptions that were made in Theorem 4-16 hold. Let {Xk} and 
{Afc} be sequences of random iterates and random trust-region radii generated by Algorithm 1. Fix 
e > 0 and define the sequence {Kfi\ consisting of the natural numbers k for which \\Vf{Xk)\\ > e 
(note that is a sequence of random variables). Then, 

^ Afc < oo 

ke{K,} 


almost surely. 

Proof. From Theorem 4.11 we know that ^ A^ < oo and hence A^ —)■ 0 almost surely. For each 
realization of Algorithm 1 and a sequence {<5^}, there exists ko such that 5^ < e/C, V/c > /cq, where C 
is defined as in Theorem 4.11. Let Kq be the random variable with realization ko and let K denote 
the sequence of indices k such that k G K,, and k > Kq. Then for all k G K, Case 1 of Theorem 
4.11 holds, i.e., ||V/(Afc)|| > (A^, since ||V/(Afc)|| > e for all k G K. From this and from (34) we 
have 

E[<^k+i - < -\ciueAk, yk > ko. 

Recall that is bounded from below. Hence, summing up the above inequality for all k G K 
and taking the expectation, we have that 


Afc < oo 

k&K 

almost surely. Since K^: C K L) {k < Kq} and Kq is finite almost surely then the statement of the 
lemma holds. □ 

We are now ready to state the lim-type result. 

Theorem 4.18. Let the same assumptions as in Theorem 4-16 hold. Let {A^} be a sequence of 
random iterates generated by Algorithm 1. Then, almost surely, 

hm ||V/(Afc)|| = 0. 

k^QO 

Proof. The proof of this result, is almost identical to the proof of the same theorem in [2] hence we 
will not present the proof here. The key idea of the proof is to show that if the theorem does not 
hold, then with positive probability 

^ Afc = oo, 
ke{K,} 

with ATe defined as in Lemma 4.17. This result is shown using Lipschitz continuity of the gradient 
and does not depend on the stochastic nature of the algorithm. Since this result contradicts the 
almost sure result of Lemma 4.17, we can conclude that the statement of the theorem holds almost 
surely. 
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5 Constructing models and estimates in different stochastic set¬ 
tings. 

We now discuss various settings of stochastic noise in the objective function and how a-probabilistically 
K-fully linear models and /^-probabilistically CiT’-accurate estimates can be obtained in these settings. 

Recall that we assume that for each x we can compute the value of /(x), which is the noisy 
version of /, 

f{x) = f{x,uj), 

where w is a random variable which induces the noise. 

Simple stochastic noise. The example of noise which is typically considered in stochastic op¬ 
timization is the “i.i.d.” noise, that is noise with distribution independent of the function values. 
Here we consider a somewhat more general setting, where the noise is unbiased for all /, i.e., 

E^[f{x,u!)] = f{x), \fx, 

and 

Var^[/(x,w)] <V< oo, Vx. 

This is the typical noise assumption in stochastic optimization literature. In the case of unbiased 
noise as above, constructing estimates and models that satisfy our assumptions is fairly straight¬ 
forward. First, let us consider the case when only the noisy function values are available (without 
any gradient information), where we want to construct a model that is k- fully linear in a given 
trust region B{x^,d) with some reasonably large probability, a. 

One can employ standard sample averaging approximation techniques to reduce the variance 
of the function evaluations. In particular, let fp{x,uj) = where Wj are the i.i.d. 

realizations of the noise u. Then, by Chebyshev inequality, for any u > 0, 

P(|/p(x,w) - /(x)]| >v) = P{\fp{x,U!) - E^[f{x, 0 j)]\ >v)< 

In particular, we want v = for some > 0 and < 1 — a' for some a', which can be 

ensured by choosing p > • 

We now construct a fully linear model as follows: given a well-poised set^ Y of n -|- I points in 
B{x^,5), at each point y* G Y, we compute fp{y'^,io) and build a linear interpolation model m(x) 
such that m{y^) = fp{y^, cu), for all i = 1,..., n -|- 1. Hence, for any y* G Y, we have 

P(|m(y*) - /(y*)]| > < 1 - a • 

Moreover, the events {|m(y*) — /(y*)]| > are independent, hence 

P( max {|m(y*) - /(y*)]|} > < 1 - 

It is easy to show using, for example, techniques described in [8], that m(x) is a K-fully linear 
model of Ei^[f{x,uj)] in B{x^,6) for appropriately chosen k = {Keg, Kef), with probability at least 
a = {aT^\ 

^See [8] for details on well-poised sets and how they can be obtained. 
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Computing the /3-probabilistically eir-accurate estimates of f{x,uj) can be done analogously to 
the construction of the models described above. 

The majority of stochastic optimization and sample average approximation methods focus on 
derivative based optimization where it is assumed that, in addition to f{x,uj), Vxf{x,uj) is also 
available, and that the noise in the gradient computation is also independent of x, that is 

E^[Vxf{x,io)]=Vf{x), Vx 


and 

Var^[||Va;/(x,a;)||] < C < oo, Vx, 

(in general the variance of the gradient and the function value are not the same, but here for 
simplicity we bound both hy V). 

In the case when the noisy gradient values are available, the construction of fully linear models 
in B{x^,5) is simpler. Let Vfp{x,uj) = ^ V/(x, Wj). Again, by extension of Chebychev 

inequality, for p such that 


p > max{ 


V V 




P(||V/p(xO,U;) - V/(x°)]|| > KegS) = P{\\Vfp{x^,u)-E^[Vf{x^,U,)]\\ > KegS) <!-«', 

and 

P(|/p(x°,w) - /(x°)| > Kef 5^) = P{\fp{x^,aj) - Eu,[f{x^,u)]\ > Kef 6^) <!-«'. 

Hence the linear expansion m(x) = fp{x^,uj) + V/p(x°, a;)^(x — x°) is a K-fully linear model of 
/(x) = Ei^[f{x,uj)] on B{x^,5) for appropriately chosen k = {Keg, Kef), with probability at least 
a = (a')^. 

In [17] it is shown that least squares regression models based on sufficiently large strongly poised 
[8] sample sets are a-probabilistically K-fully linear models. 

There are many existing methods and convergence results using sample average approximations 
and stochastic gradients for stochastic optimization with i.i.d. or unbiased noise. Some of these 
methods have been shown to achieve optimal sampling rate [14], that is they converge to the optimal 
solution while sampling the gradient at the best possible rate. We do not provide convergence rates 
in this paper (it is a subject for future research), hence it remains to be seen if our algorithm can 
achieve the optimal rate. Our contribution here is the method which applies beyond the i.i.d. case, 
as we will discuss below. In Section 6, however, we demonstrate that our method can have superior 
numerical behavior compared to standard sample averaging even in the case of the i.i.d. noise, so 
it is at least competitive in practice. 


Function computation failures. Now, let us consider a more complex noise case. Suppose that 
the function f(x) is computed (as it often is, in black-box optimization) by some numerical method 
that includes a random component. Such examples are common in machine learning applications, 
for instance, where x is the vector of hyperparameters of a learning algorithm and /(x) is the 
expected error of the resulting classifier. In this case, for each value of x, the classifier is obtained 
by solving an optimization problem, up to a certain accuracy, on a given training set. Then the 
classifier is evaluated on the testing set. If a randomized coordinate descent or a stochastic gradient 
descent is applied to train the classifier for a given vector x, then the resulting classifier is sufficiently 
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close to the optimal classifier with some known probability. However, in the case when the training 
of the classifier fails to produce a sufficiently accurate solution, the resulting error is difficult to 
estimate. Usually, it is possible to know the upper bound on the value of this inaccurate objective 
function but nothing else may be known about the distribution of this value. Moreover, it is likely 
that the probability of the accurate computation depends on x; for example, after k iterations of 
randomized coordinate descent, the error between the true f{x) and the computed f{x) is bounded 
by some value, with probability a, where the value depends on a, /c, and x [22]. 

Another example is solving a system of nonlinear black-box equations. Assume that we seek x 
such that — C*) for some functions fi{x), i = 1,... ,m that are computed by numerical 

simulation, with noise. As is often done in practice (and is supported by our theory) the noise in 
the function computation is reduced as the algorithm progresses, for example, by reducing the size 
of a discretization, step size, or convergence tolerance within the black-box computation. These 
adjustments for noise reduction usually increase the workload of the simulation. With the increase 
of the workload, there is an increased probability of failure of the code. Hence, the smaller the 
values of fi{x), the more likely the computation of fi{x) will fail and some inaccurate value is 
returned. 

These two examples show that the noise in f{x) may be large with some positive probability, 
which may depend on x. Hence, let us consider the following, idealized, noise model 


fix) = fix,uj) 


fix), w.p. l-o-(x) 
w(x) < V w.p. ct(x), 


where cj(x) is the probability with which the function f{x) is computed inaccurately, and uj{x) is 
some random function of x, for which only an upper bound V is known. This case is idealized, 
because we assume that with probability 1 — (t(x), /(x) is computed exactly. It is trivial to extend 
this example to the case when /(x) is computed with an error, but this error can be made sufficiently 
small. 

For this model of function computation failures we have 


E^[f{x,u:)] = (1 - cj(x))/(x) -h cj(x)F;[w(x)] ^ /(x), Vcr(x) > 0. 

and it is clear, that for any (t(x) > 0, unless E[uj{x)] = some constant, optimizing Ei^[f{x,u})] 
does not give the same result as optimizing /(x). Hence applying Monte-Carlo sampling within an 
optimization algorithm solving this problem is not a correct approach. 

We now observe that constructing a-probabilistically K-fully linear models and /3-probabilistically 
CiT’-accurate estimates is trivial in this case, assuming that cj(x) < a for all x, when a is small enough. 
In particular, given a trust region B{x^, S), sampling a function /(x) on a sample set Y C H(x°, 5) 
well-poised for linear interpolation will produce a k- fully linear model in B{x^,5) with probabil¬ 
ity at least (1 — since with this probability all of the function values are computed exactly. 

Similarly, for any s G H(x®, 5), the function estimates and T® are both correct with probability 
at least (1 — ct )^. Assuming that (1 — > a and (1 — > /3, where a and j3 satisfy the 

assumptions of Theorem 4.11 and Lemma 4.12 and afi > ^ as in Theorem 4.16, we observe that 
the resulting models satisfy our theory. 

Remark 5.1. We assume here that the probability of failure to eompute fix) is small enough for 
all X. In the machine learning example above, it is often possible to control the probability cj(x) 
in the computation of fix), for example by increasing the number of iterations of a randomized 
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coordinate descent or stochastic gradient method. In the case of the black-box nonlinear equation 
solver, the probability of code failure is expected to be quite small. There are, however, examples of 
black box optimization problems where the computation of f{x) fails all the time for specific values 
of x. This is often referred to as hidden constraints [19]. Clearly our theory does not apply here, 
but we believe there is no local method that can provably converge to a local minimizer in such a 
setting without additional information about these specific values of x. 

6 Computational Experiments 

In this section, we will discuss the performance of several variants of our proposed method (varied 
in the way the models are constructed), henceforth only referred to as STORM (STochastic Op¬ 
timization using Random Models), that target various noisy situations discussed in the previous 
section. We note that a comparison of STORM to the SPSA method of [26] and the classical 
Kiefer-Wolfowitz method in [16] has been reported in [6] and shows that STORM significantly 
outperformed these two methods, while no special tuning of SPSA or Kiefer-Wolfowitz was ap¬ 
plied. Since a trust-region based method, which is able to use second order information is likely to 
outperform stochastic gradient-like methods in many settings, we omit such comparison here. 

Throughout this section, all proposed algorithms were implemented in Matlab and all exper¬ 
iments were performed on a laptop computer running Ubuntu 14.04 LTS with an Intel Celeron 
2955U @ 1.40GHz dual processor. 

6.1 Simple stochastic noise 

In these experiments, we used a set of 53 unconstrained problems adapted from the CUTEr test 
set, each being in the form of a sum of squares problem, i.e. 

m 

fix) = ^(/i(x))^ (49) 

i=l 

where for each i G {1,..., m}, fi{x) is a smooth function. Two different types of noise will be 
used in this first subsection, which we will refer to as multiplicative noise and additive noise. In the 
multiplicative noise case, for each i G {1,..., m}, we generate some cu* from the uniform distribution 
on [—O', fj] for some parameter cj > 0, and then compute the noisy function 

m 

f{x,U!) = ^{{l+UJi)fi{x))‘^. (50) 

i=l 

The key characteristic of this noise is that for each x, we have £'t^[/(x,w)] = f{x), however the 
variance is nonconstant over x and scales with the magnitudes of the components fi{x). Thus, one 
should expect that if an algorithm is minimizing a function of the form (50), the accuracy of the 
estimates of f{x) based on a constant number of samples of f{x,uj) should increase, assuming that 
the algorithm produces a decreasing sequence {f{x^)}^=i- While this behavior is not supported by 
theory, because we do not know how quickly f{x^) decreases, our computational results show that 
a constant number of samples is indeed sufficient for convergence. 
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The other type of noise we will test is additive, i.e. we additively perturb each component in 
(49) by some uii uniformly generated in [—(t, cr] for some parameter ct > 0. That is, 


m 

fix, = ^ifiix) + uJif (51) 

i=l 

Note that the noise is additive only in terms of the component functions, but not in terms of the 
objective function, moreover Ei^[f{x,uj)] = f{x) + YIT However, the constant bias term 

does not affect optimization results, since minx E^[f (x, tv)] = niinj, fix). 

In our first set of experiments for these two noisy settings, we compare a version of STORM 
to a version of sample average based trust region algorithms, which we will call “TR-SAA”, and 
which is similar to a trust-region algorithm presented in [9]. Similar method, with convergence 
guarantees, has been recently proposed in [25]. In their work, they use a Bayesian scheme to 
select a sufficiently large sample complexity for computing average function values at a current 
interpolation set. Here, in TR-SAA, we simplify this approach, by increasing sample complexity in 
each iteration proportionally to the decrease of the trust region radius 6^. A description of TR-SAA 
is given in Algorithm 2 in the Appendix. 

There are two particular aspects of TR-SAA that we would like to draw attention to; in the 
estimate calculation step, the computation of is performed before the model ruk is constructed, 
and rrifc is built to interpolate /^, hence the quality of estimate and that of the model are 
dependent. Additionally, the quality of the model ruk is dependent on that of rrik-i because the 
samples are reused. Both of these aspects are violations of the typical assumptions of STORM. 
Thus, we also propose TR-SAA-resample, which is the same algorithm as TR-SAA except that 
at each iteration, the function value at every interpolation point is recomputed as an average 
of function evaluations, independent of past function evaluations. While TR-SAA-resample may 
overcome some of the problems of the dependence of nik on nik-i, it still doesn’t satisfy the 
assumptions of STORM because of the dependence of on ink- 

Thus, in Algorithm 3, stated in the Appendix, we propose a version of STORM, comparable 
to TR-SAA in terms of sample sizes. In Algorithm 3, the models nik and nik-i are entirely 
independent since a new regression set is drawn in each iteration. Additionally, in the estimates 
calculation step, the computations of and /| are completely independent of the model m^. For 
these reasons. Algorithm 3 is more in line with the theory analyzed in this paper than a sample 
average approximation scheme like in Algorithm 2. 

For each of the 53 problems, the best known value of the noiseless f{x) obtained by a solver is 
recorded as f*. We recorded the number of function evaluations required by a solver to obtain a 
function value /(x^) < /' such that 


I — r < 


fjx^) - f 

/(xO) - /*■ 


(52) 


This number was averaged over 10 runs for each problem. In the profiles shown in Figure (6.1) 
for the multiplicative noise case r = 10“^. In all the experiments, a budget of I000(n -|- I) noisy 
function evaluations was set. For the choice of initialization, the same parameters were used in all 
of TR-SAA, TR-SAA-resample, and STORM-unbiased: (5max = 10, 5o = 1,7 = 2,?7i = O.I ,?72 = 

0.001, Pmin = 10. 

Note that even though we have ignored the theoretical prescription derived in the previous 
section that sample rate should scale with 1/(5^, we note that STORM-unbiased performs extremely 
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Figure 1: Performance profiles (r = 10 comparing STORM-unbiased, TR-SAA, and TR-SAA- 
resample on the testset. 
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well compared to the TR-SAA method. Although we chose to sample at a rate so that pk was on the 
order of l/6k, this particular sample rate was chosen after testing various other rates on the same 
set of test functions, and seemed to work relatively well for both STORM-unbiased and TR-SAA. 

Function computation failures. In these experiments, we used the same 53 sum of squares 
problems as in the unbiased noise experiments described above, but introduced biased noise. For 
each component in the sum in (49), if \fi{x)\ < e for some parameter e > 0, then fi{x) is computed 
as 


fi{x) 


fi{x) w.p. 1 - cr 
V w.p. a 


for some parameter cj > 0 and for some “garbage value” V. If fi{x) > e, then it is deterministically 
computed as fi{x). This noise is biased, with bias depending on x, and we should not expect any 
sort of averaging approximation to work well here. This is indeed indicated in our experiments, 
where various levels of a and e are shown below. There choice of V did not significantly affect the 
results, and in the experiments illustrated below V = —10000 was used. Obviously, the intention 
here is that such a large negative value will cause STORM to see a trial step as promising, when 
it may, in fact, yield an increase in function value if taken. We propose the version of STORM 
presented as Algorithm 4 in the appendix. 

The key feature of Algorithm 4 is that on each iteration, the interpolation set changes minimally 
as in a typical DFO trust region method, but the interpolated function values are computed afresh. 
Intuitively, this is the right thing to do, since if a “garbage value” is computed at some point in 
the algorithm to either construct a model or provide a function value estimate, we do not want its 
presence to affect the computation of models in subsequent iterations. No averaging is performed, 
as it can only cause harm in the setting. 

Since we are not aware of any other optimization algorithm that is designed for this case of noise, 
we performed no comparisons, but experiment to discover how the method works as a function of 
the probability of failure. On the test set of 53 problems, we ran Algorithm 4 30 times and report 
the average percentage of instances that are solved in the sense of (52) with r = 10“^ within a 
budget of 10000(n -|- 1) function evaluations, where f* was computed by Algorithm 4 with a = 0. 
In order to standardize the probability of failure over the test set, we define the probability of 
success ps and then take o" on a function with m component to be cr = 1 — The results are 

summarized in Figure 6.1. 

This experiment suggests that the practical threshold at which STORM fails to make progress 
may be looser than that suggested by theory. We will illustrate this idea through a simple example. 
Consider the minimization of the simple quadratic function 


n 


fix) = 2 ^iXi - 1)^ 

i=l 


(53) 


The minimizer uniquely occurs at the vector of all Is. Now consider the minimization of this 
function under our setting of computation failure where we vary the probability parameter a and 
fix e = 0.1. Suppose on each iteration of STORM, the interpolation set contains (n -|- l)(n -|- 2)/2 
points. Then, the probability of obtaining the correct quadratic model in the worst case where for 


all 


\xi - 1 | 


< e IS precisely a = ((1 — ay ) 2 


Likewise, the probability of obtaining the 
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Figure 2: Average percentage of problems solved as a function of probability of success ps- 



Probability of success, p 


correct function evaluation for Fq and Fg on each iteration in the worst case is /3 = ((1 — cr)”)^. 
Now, supposing we initialize STORM with the zero vector in M”, it is reasonable to assume that 
all iterates will occur near the unit cube [ 0 , 1 ]”, and so we can use simple calculus to estimate a 
Lipschitz constant of the function over the relevant domain as 2^/n, and the Lipschitz constant of 
the gradient is constantly 2. These also serve as reasonable estimates of Kgf and Keg, respectively. 
Using parameter choices 7 = 2, 771 = 0.1, 72 = 1, we can use the bounds in (40) and (41) to solve 
for the smallest allowable (1 — a) for which our algorithm can guarantee convergence. For n = 2, 
our theory suggests that we can safely lower bound (1 — u) > 0.9592 (which implies a ~ 0.6069 
and (3 ^ 0.8467), and for n = 10, we can lower bound (1 — u) > 0.9990 (which implies a ps 0.5233 
and /? pa 0.9806). 

In Figure 6.1 for n = 2,10, we plot an indicated level of (1 — a) on the x-axis against the 
proportion of 100 randomly seeded instances with that level of (1 — a) that Algorithm 4 managed 
to find a solution x* satisfying f{x*) < 10 “^ within 10 “^ many function evaluations using the dis¬ 
cussed parameter choices. The red line shows the level of (1 — a) that our theory predicted in the 
previous paragraph. As we can see, (1 — a) can be quite smaller than predicted by our theory 
before the failure rate becomes unsatisfactory. As a particular example, in the n = 10 case, when 
(1 —fj) = .998, the corresponding probabilities are a k. 0.266782 and /3 ps 0.960751, and yet 100% of 
the instances were solved to the required level of accuracy. In other words, even though the models 
are eventually only accurate on roughly 27% of the iterations, we still see satisfactory performance. 


6.2 Stochastic gradient based method comparison 

In this subsection, we show how STORM applies to empirical risk minimization in machine learning. 
Consider a training dataset of N samples, {(xj, yi)}j=i,,.Ar, where Xj G is a vector of m real¬ 
valued and Hi G {—1,1} indicates a positive or negative label respectively. We will train a linear 
classifier and bias term {w, (3) G by minimizing the smooth (convex) regularized logistic loss 

N N 

f{w, /3) = ^ ^ fi{w, /3) + AlIrriP = ^ X] + exp{-yi{w'^Xi + 13))) + A||u;|p. 

i=l i=l 

As in the typical machine learning setting, we will assume that N » m and computing f{w, (3) 
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Figure 3: Minimizing a simple quadratic function (dimension n = 2 in the left, n = 10 in the right) 
where with probability 1 — a, a coordinate is computed incorrectly. 


as well as V and V‘^f{w,/3) is prohibitive. Hence we will only compute estimates of these 
quantities by considering a sample / C {1,..., N} of size |/| = n << N, yielding 

fi{w, /3) = t4 X] ^ X] 

' ' iei II iei 

= ^Y.V^f,{w,P) + 2XIn. 

I I is/ 

Thus we can construct models based on sample gradient and Hessians information and we present 
the appropriate variant of STORM as Algorithm 5 in the Appendix. 

We compare Algorithm 5 with the well-known implementation of Adagrad from the Ada- 
whatever package described in [10]. We compare against this particular solver because it is a 
well-understood stochastic gradient method used by the machine learning community that, like our 
algorithm, takes adaptive step sizes, but unlike our algorithm, does not compute estimates of the 
loss function, but only computes averaged stochastic gradients. For the choice of initialization in 
Algorithm 5, the following parameters were used: (5max = 10, (5o = 1, xq = 0 ,7 = 2, r/i = 0 . 1,72 = 
0.001, pmin = rn + 2,pmax = A^- Adagrad was also given the same initial point and an initial step 
size of (5o = 1. 

We implemented two versions of Algorithm 5: one which uses stochastic Hessians, and a second 
where we do not compute stochastic Hessians, effectively setting Hk = 0 on each iteration, yielding 
a trivial subproblem in the step calculation. 

For each of the datasets, we randomly partition N into a training set of size [0.95 * A^J and 
a testing set of size [0.05 * A^]. We set the maximum budget of data evaluations for each solver 
equal to the size of the training set, thus comparing various solvers’ performance with a budget 
of roughly one full pass through the dataset. In the two implementations of STORM, we plot in 
4 the true training loss function value at the end of each successful iteration, while for Adagrad, 
we simply plot the true training loss function value over an evenly spaced array of function value 
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Figure 4: Trajectory of training logistic loss on four datasets. 
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Figure 5: Trajectory of testing logistic loss on four datasets. 
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counts. Likewise in 5, we plot at the same points the value of the holdout testing loss function 
value. 

Notice that, as expected, the true function values produced by Adagrad can vary widely over 
this horizon, but implementations of STORM tend to yield fairly stable decreasing trajectories over 
its successful iterations. Also, we see that the loss seems to generalize fairly well to the holdout 
test data. 
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Algorithm 2: TR-SAA 

1 (Initialization): Choose an initial point xq and an initial trust-region radius 6o £ (0,5max) 

with (5max > 0. Choose constants 7 > 1, G (0,1), 72 £ (0;Co), Pmax = {n + l)(n -|- 2)/2, 
Pmin > n + 1. Set k 0. Select some initial interpolation set Yq C B{xk, 6k) so that 
I^01 < Pmax and xq £ Yq. Compute an averaged function value estimate 
^ at each y £ Yq. 

2 while true do 

3 (Update sample rate): Set sample rate pk = maxjpmin + k,l/6}. 

4 (Update interpolation value estimates): For each y G 14 n Yfc_i, compute 

f{y) = ^[Er=pfe_i-hi hy^^i) + f{y)] and for each y £Yk\ Ufc-i, compute 

fiy) = ^Ef=i/(y>a;i) . 

5 If the algorithm is TR-SAA-resample, for each y £Yk, compute 

fiy) = ^Ef=i/(y>a;i). 

6 (Model building): Build a quadratic model mk{xk + s) = fk + gjs + Hk-s with 

s = X — Xk that interpolates f{y) at the points of Yk. 

7 (Step calculation): Compute Sk = arg min mk{s) (approximately) so that Sk 

s:\\s\\<Sk 

satisfies (3). 

8 (Estimate calculation): Compute new estimate fk~^ fi^k + Sk, 0 Ji) of 

f{xk -|- Sk). Denote the current estimate of f{xk) by /°. 

jO _ ys 

9 (Acceptance of the trial point): Compute pk = -7 —f -r. 

mk{xk) - mk[xk Sk) 

10 If Pk > Pi and \\gk\\ > 724, then x^+i ^ Xk + Sk; otherwise, Xk+i ^ Xk- 

11 (Trust-region radius update): If pk > 71 , 4-i-i iiifo{74,'^max}; otherwise 

4-1-1 7 ^4- 

12 (Interpolation set update): Augment Yk with Xk + Sk- If |14| > Pmax, remove the 

point of Yk furthest from Xk+i- 

13 (Iterate): k k + 1. 

14 end 





Algorithm 3: STORM for unbiased noise 

1 (Initialization): Choose an initial point xq and an initial trust-region radius 6o G (0,5max) 

with (5max > 0. Choose constants 7 > 1, G (0, 1), 72 G (0,oo), Pmin; Set k 0. Select a 
regression set Yq C B{xk,6k) satisfying IIqI = Pmin- 

2 while true do 

3 (Update sample rate): Choose a sample rate pk = rnaxlpmin -|- k, 1/(5}. 

4 (Regression set update): Uniformly sample a regression set Yk C B{xk,Sk) satisfying 

|lfc| =Pk- 

5 (Compute new regression value estimates): For each y £Yk, compute a single 

estimate f{y, a;). 

6 (Model building): Build a quadratic model mk{xk + s) = fk + gjs + Hk-s with 

s = X — Xk that regresses f{y) at the points of Y/. 

7 (Step calculation): Compute Sk = arg min mk{s) (approximately) so that Sk 

s:||s||<(5fc 

satisfies (3). 

8 (Estimates calculation): Compute new estimates = X]?=i f{xk, 0 Ji) and 

fk = Ylti f{xk + Sk,uji) of f{xk) and f{xk + Sfc). 

9 (Acceptance of the trial point): Compute pk = -7— ^^ -r. 

mk[Xk) - mk[xk + Sk) 

10 If Pk > Pi and \\gk\\ > 724, then x^+i ^ Xk + Sk; otherwise, Xk+i Xk- 

11 (Trust-region radius update): If pk > 71 , 4-i-i iiiiii{74,'5max}; otherwise 

4-1-1 7 ^4- 

12 (Iterate): /c /c + 1. 
end 
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Algorithm 4: STORM for unbiased noise 

1 (Initialization): Choose an initial point xq and an initial trust-region radius 6o G (0,5max) 

with (5max > 0. Choose constants 7 > 1, G (0,1), 72 £ (0, 00 ), Po ^ n + 1 , 

Pmax = {n + l)(n + 2 )/ 2 . Set k 0. Select an interpolation set Yq C B{xk,Sk) satisfying 

|>o| =Po- 

2 while true do 

3 (Compute new interpolation value estimates): For each y gY^, compute a (new) 

estimate f{y, w). 

4 (Model building): Build a quadratic model mk{xk + s) = fk + gls + H^s with 

s = X — Xk that interpolates f{y) at the points of Yk- 

5 (Step calculation): Compute Sk = arg min mk{s) (approximately) so that Sk 

satisfies (3). 

6 (Estimates calculation): Compute new estimates /)? = f{xk,iOi) and 

fk = fi^k + Sk,u}i) of f{xk) and f{xk + Sk). 

7 (Acceptance of the trial point): Compute pk = - 7 — ^ -r. 

mk{xk) - mk[xk + Sk) 

8 If Pfc > and \\gk\\ > 724, then x^+i ^ xa, + Sk; otherwise, Xk+i ^ Xk- 

9 (Trust-region radius update): If pk > 71 , 4-i-i niin{74,'^max}; otherwise 

4-1-1 7 ^4- 

10 (Interpolation set update): Augment Yk with Xk + Sk- If |Ffc| > Pmax, remove the 

point of Yk furthest from the new trust region center Xk+i- 

11 (Iterate): /c/c + 1. 
end 
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Algorithm 5: STORM for minimizing logistic loss 

1 (Initialization): Choose an initial point xq = {wq,/3o) and an initial trust-region radius 

(5o G (0, 5max) with (5max > 0. Choose constants 7 > 1, 771 G (0,1), 72 G (0, 00 ), po, pmax; Set 
k^O. 

2 while true do 

3 (Determine sample rate): Choose a sample rate pk- In our implementation, we will 

use Pk = min{pmax,max {100 * k + po, [1/5^1 }}• 

4 (Model building): Uniformly (without replacement) draw a sample Ik C A}. 

Compute a stochastic gradient gk = V (I) and stochastic Hessian 

Hk = fi^,{w, /3). Define a quadraticmodel mk{s) = g"[s + ^s'^HkS. and stochastic 

Hessian 

5 (Step calculation): Compute Sk = arg min mk{s) (approximately) so that Sk 

satisfies (3). 

6 (Estimates calculation): Draw new samples Ik,Ik and compute estimates 

fk = and /| = fidxk + Sk) of f{xk) and f{xk + Sk), respectively. 

7 (Acceptance of the trial point): Compute pk = - 7 — ^^ -r. 

mk{xk) - mk[xk + Sk) 

8 If Pfc > Vi and \\gk\\ > 724 , then Xk+i ^ Xk + Sk; otherwise, Xk+i ^ Xk- 

9 (Trust-region radius update): If pk > 71 , 4-i-i niin{74, <Imax}; otherwise 

4-1-1 7 ^4- 

10 (Iterate): k k + 1. 
end 
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